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Abstract 


Efficient and expressive comparison of sequences is an essential procedure for learning with se- 
quential data. In this article we propose a generic framework for computation of similarity mea- 
sures for sequences, covering various kernel, distance and non-metric similarity functions. The 
basis for comparison is embedding of sequences using a formal language, such as a set of natu- 
ral words, k-grams or all contiguous subsequences. As realizations of the framework we provide 
linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix 
trees as underlying data structures. 

Experiments on data sets from bioinformatics, text processing and computer security illustrate 
the efficiency of the proposed algorithms—enabling peak performances of up to 10° pairwise com- 
parisons per second. The utility of distances and non-metric similarity measures for sequences as 
alternatives to string kernels is demonstrated in applications of text categorization, network intru- 
sion detection and transcription site recognition in DNA. 


Keywords: string kernels, string distances, learning with sequential data 


1. Introduction 


Sequences of discrete symbols are one of the fundamental data representations in computer sci- 
ence. A great deal of applications—from search engines to document ranking, from gene finding to 
prediction of protein functions, from network surveillance tools to anti-virus programs—critically 
depend on analysis of sequential data. Providing an interface to sequential data is therefore an 
essential prerequisite for applications of machine learning in these domains. 

Machine learning algorithms have been traditionally designed for vectorial data—probably due 
to the availability of well-defined calculus and mathematical analysis tools. A large body of such 
learning algorithms, however, can be formulated in terms of pairwise relationships between objects, 
which imposes a much looser constraint on the type of data that can be handled. Thus, a powerful 
abstraction between algorithms and data representations can be established. 

The most prominent example of such abstraction is kernel-based learning (e.g., Müller et al., 
2001; Schédlkopf and Smola, 2002) in which pairwise relationships between objects are expressed 
by a Mercer kernel, an inner product in a reproducing kernel Hilbert space. Following the seminal 
work of Boser et al. (1992), various learning methods have been re-formulated in terms of kernels, 
such as PCA (Schdélkopf et al., 1998b), ridge regression (Cherkassky et al., 1999), ICA (Harmeling 
et al., 2003) and many others. Although the initial motivation for the “kernel trick” was to allow 
efficient computation of an inner product in high-dimensional feature spaces, the importance of an 
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abstraction from data representation has been quickly realized (e.g., Vapnik, 1995). Consequently 
kernel-based methods have been proposed for non-vectorial domains, such as analysis of images 
(e.g., Scholkopf et al., 1998a; Chapelle et al., 1999), sequences (e.g., Jaakkola et al., 2000; Watkins, 
2000; Zien et al., 2000) and structured data (e.g., Collins and Duffy, 2002; Gartner et al., 2004). 


Although kernel-based learning has gained significant attention in recent years, a Mercer kernel 
is only one of many possibilities for defining pairwise relationships between objects. Numerous 
applications exist for which relationships are defined as metric or non-metric distances (e.g., An- 
derberg, 1973; Jacobs et al., 2000; von Luxburg and Bousquet, 2004), similarity or dissimilarity 
measures (e.g., Graepel et al., 1999; Roth et al., 2003; Laub and Miiller, 2004; Laub et al., 2006) or 
non-positive kernel functions (e.g., Ong et al., 2004; Haasdonk, 2005). It is therefore imperative to 
address pairwise comparison of objects in a most general setup. 


The aim of this contribution is to develop a general framework for pairwise comparison of 
sequences. Its generality is manifested by the ability to handle a large number of kernel func- 
tions, distances and non-metric similarity measures. From considerations of efficiency, we focus 
on algorithms with linear-time asymptotic complexity in the sequence lengths—at the expense of 
narrowing the scope of similarity measures that can be handled. For example, we do not consider 
super-linear comparison algorithms such as the Levenshtein distance (Levenshtein, 1966) and the 
all-subsequences kernel (Lodhi et al., 2002). 


The basis of our framework is embedding of sequences in a high-dimensional feature space 
using a formal language, a classical tool of computer science for modeling semantics of sequences. 
Some examples of such languages have been previously used for string kernels, such as the bag- 
of-words, k-gram or contiguous-subsequence kernel. Our formalization allows one to use a much 
larger set of possible languages in a unified fashion, for example subsequences defined by a finite 
set of delimiters or position-dependent languages. A further advantage of embedding using formal 
languages is separation of embedding models from algorithms, which allows one to investigate 
different data structures to obtain optimal efficiency in practice. 


Several data structures have been previously considered for specific similarity measures, such 
as hash tables (Damashek, 1995), sorted arrays (Sonnenburg et al., 2007), tries (Leslie et al., 2002; 
Shawe-Taylor and Cristianini, 2004; Rieck et al., 2006), suffix trees using matching statistics (Vish- 
wanathan and Smola, 2004), suffix trees using recursive matching (Rieck et al., 2007) and suffix 
arrays (Teo and Vishwanathan, 2006). All of these data structures allow one to develop linear-time 
algorithms for computation of certain similarity measures. Most of them are also suitable for the 
general framework developed in this paper; however, certain trade-offs exist between their asymp- 
totic run-time complexity, implementation difficulty and restrictions on embedding languages they 
can handle. To provide an insight into these issues, we propose and analyze three data structures 
suitable for our framework: sorted arrays, tries and suffix trees with an extension to suffix arrays. 
The message of our analysis, supported by experimental evaluation, is that the choice of an optimal 
data structure depends on the embedding language to be used. 


This article is organized as followed: In Section 2 we review related work on sequence com- 
parison. In Section 3 we introduce a general framework for computation of similarity measures for 
sequences. Algorithms and data structures for linear-time computation are presented in Section 4. 
We evaluate the run-time performance and demonstrate the utility of distinct similarity measures in 
Section 5. The article is concluded in Section 6 
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2. Related Work 


Assessing the similarity of two sequences is a classical problem of computer science. First ap- 
proaches, the string distances of Hamming (1950) and Levenshtein (1966), originated in the domain 
of telecommunication for detection of erroneous data transmissions. The degree of dissimilarity be- 
tween two sequences is determined by computing the shortest trace of operations—insertions, dele- 
tions and substitutions—that transform one sequence into the other (Sankoff and Kruskal, 1983). 
Applications in bioinformatics motivated extensions and adaptions of this concept, for example 
defining sequence similarity in terms of local and global alignments (Needleman and Wunsch, 1970; 
Smith and Waterman, 1981). However, similarity measures based on the Hamming distance are re- 
stricted to sequences of equal length and measures derived from the Levenshtein distance (e.g., Liao 
and Noble, 2003; Vert et al., 2004), come at the price of computational complexity: No linear-time 
algorithm for determining the shortest trace of operations is currently known. One of the fastest 
exact algorithms runs in O(n?/logn) for sequences of length n (Masek and Patterson, 1980). 


A different approach to sequence comparison originated in the field of information retrieval with 
the vector space or bag-of-words model (Salton et al., 1975; Salton, 1979). Textual documents are 
embedded into a vector space spanned by weighted frequencies of contained words. The similarity 
of two documents is assessed by an inner-product between the corresponding vectors. This concept 
was extended to k-grams—k consecutive characters or words—in the domain of natural language 
processing and computer linguistic (e.g., Suen, 1979; Cavnar and Trenkle, 1994; Damashek, 1995). 
The idea of determining similarity of sequences by an inner-product was revived in kernel-based 
learning in the form of bag-of-words kernels (e.g., Joachims, 1998; Drucker et al., 1999; Joachims, 
2002) and various string kernels (e.g., Zien et al., 2000; Leslie et al., 2002; Vishwanathan and 
Smola, 2004). Moreover, research in bioinformatics and text processing advanced the capabilities of 
string kernels, for example, by considering gaps, mismatches and positions in sequences (e.g., Lodhi 
et al., 2002; Leslie et al., 2003; Leslie and Kuang, 2004; Rousu and Shawe-Taylor, 2005; Rätsch 
et al., 2007). The comparison framework proposed in this article shares the concept of embedding 
sequences with all of the above kernels, in fact most of the linear-time string kernels (e.g., Joachims, 
1998; Leslie et al., 2002; Vishwanathan and Smola, 2004) are enclosed in the framework. 


A further alternative for comparison of sequences are kernels derived from generative proba- 
bility models, such as the Fisher kernel (Jaakkola et al., 2000) and the TOP kernel (Tsuda et al., 
2002). Provided a generative model, for example a HMM trained over a corpus of sequences or 
modeled from prior knowledge, these kernel functions essentially correspond to inner-products of 
partial derivatives over model parameters. The approach enables the design of highly specific sim- 
ilarity measures which exploit the rich structure of generative models, for example, for prediction 
of DNA splice sites (Ratsch and Sonnenburg, 2004). The run-time complexity of the kernel com- 
putation, however, is determined by the number of model parameters, so that only simple models 
yield run-time linear in the sequence lengths. Moreover, obtaining a suitable parameter estimate for 
a probabilistic model can be difficult or even infeasible in practical applications. 


3. Similarity Measures for Sequential Data 


Before introducing the framework for computation of similarity measures, we need to establish 
some basic notation. A sequence x is a concatenation of symbols from an alphabet A. The set of 
all possible concatenations of symbols from A is denoted by .A* and the set of all concatenations 
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of length k by A*. A formal language L C A* is any set of finite-length sequences drawn from A 
(Hopcroft and Motwani, 2001). The length of a sequence x is denoted by |x| and the size of the 
alphabet by |A|. A contiguous subsequence w of x is denoted by w E x, a prefix of x by w Ep x 
and a suffix by w E, x. Alternatively, a subsequence w of x ranging from position į to j is referred 
to as x[1../]. 











3.1 Embedding Sequences using a Formal Language 


The basis for embedding of a sequence x is a formal language L, whose elements are sequences 
spanning an |L|-dimensional feature space. We refer to L as the embedding language and to a 
sequence w € L as a word of L. There exist numerous ways to define L reflecting particular aspects 
of application domains, yet we focus on three definitions that have been widely used in previous 
research: 


1. Bag-of-words. In this model, L corresponds to a set of words from a natural language. L can 
be either defined explicitly by providing a dictionary or implicitly by partitioning sequences 
according to a set of delimiter symbols D C A (e.g., Salton, 1979; Joachims, 2002). 


L = Dictionary (explicit), L = (A\ D)* (implicit). 


2. K-grams and blended k-grams. For the case of k-grams (in bioinformatics often referred 
to as k-mers), L is the set of all sequences of length k (e.g., Damashek, 1995; Leslie et al., 
2002). The model of k-grams can further be “blended” by considering all sequences from 
length j up to k (e.g., Shawe-Taylor and Cristianini, 2004). 


k 
L = A‘ (k-grams), L= U A' (blended k-grams). 


i=j 


3. Contiguous sequences. In the most general case, L corresponds to the set of all contiguous 
sequences or alternatively to blended k-grams with infinite k (e.g., Vishwanathan and Smola, 
2004; Rieck et al., 2007). 


CO 
L=A* or BELLAS 
i=l 


Note that the alphabet A in the embedding languages may also be composed of higher semantic 
constructs, such as natural words or syntactic tokens. In these cases a k-gram corresponds to k 
consecutive words or tokens, and a bag-of-words models could represent textual clauses or phrases. 

Given an embedding language L, a sequence x can be mapped into the | L|-dimensional feature 
space by calculating a function ¢,,(x) for every w € L appearing in x. The embedding function ® 
for a sequence x is given by 


D: x9 (Pw (X))wer with $,(x) := occ(w,x)-Wy () 


where occ(w, x) is the number of occurrences of w in the sequence x and W, a weighting assigned 
to individual words. Alternatively occ(w, x) may be defined as frequency, probability or binary flag 
for the occurrences of w in x. 
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While the choice and design of an embedding language L offer a large degree of flexibility, 
it is often necessary to refine the amount of contribution for each word w € L, for example it is 
a common practice in text processing to ignore stop words and terms that do not carry semantic 
content. In the embedding function (1) such refinement is realized by the weighting term W,,. The 
following three weighting schemes for defining W, have been proposed in previous research: 


1. Corpus dependent weighting. The weight W, is based on the occurrences of w in the corpus 
of sequences (see Salton et al., 1975). Most notable is the inverse document frequency (IDF) 
weighting, in which W, is defined over the number of documents N and the frequency d(w) 
of w in the corpus. 

Wp = log, N —log,d(w) +1. 


If occ(w, x) is the frequency of w in x, the embedding function (1) corresponds to the well- 
known term frequency and inverse document frequency (TF-IDF) weighting scheme. 


2. Length dependent weighting. The weight W, is based on the length |w| (see Shawe-Taylor 
and Cristianini, 2004; Vishwanathan and Smola, 2004), for example, so that longer words 
contribute more than shorter words to a similarity measure. A common approach is defining 
Wp using a decay factor 0 < 4 < 1. 


Wp = 2 Tlel 


3. Position dependent weighting. The weight W, is based on the position of w in x. Vish- 
wanathan and Smola (2004) propose a direct weighting scheme, in which W, is defined over 
positional weights W(k,x) for each position k in x as 


j 
Wo = Wat] = >) WE,x). 


An indirect approach to position dependent weighting can be implemented by extending the 
alphabet A with positional information to A= AxN, so that every element (a, k) € A of the 
extended alphabet is a pair of a symbol a and a position k. 


The introduced weighting schemes can be coupled to further refine the embedding based on 
L, for example, in text processing the impact of a particular term might be influenced by the term 
frequency, inverse document frequency and its length. 


3.2 Vectorial Similarity Measures for Sequences 


With an embedding language L at hand, we can now express common vectorial similarity measures 
in the domain of sequences. Table 1 and 2 list well-known kernel and distance functions (see Vapnik, 
1995; Schélkopf and Smola, 2002; Webb, 2002) in terms of L. The histogram intersection kernel 
in Table 1 derives from computer vision (see Swain and Ballard, 1991; Paone et al., pore) and the 
Jensen-Shannon divergence in Table 2 is defined using H (x, y 

A further and rather exotic class of vectorial similarity measures are lati) oeer (see 
Sokal and Sneath, 1963; Anderberg, 1973). These coefficients have been designed for comparison 
of binary vectors and often express non-metric properties. They are constructed using three summa- 
tion variables a,b and c, which reflect the number of matching components (1/1), left mismatching 
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Kernel k(x, y) 

Linear Diver Pw (¥) Pw Y) 
Polynomial (X jé 1 Pw(X)Pw(y) +4 ) = 
Sigmoidal tanh (© per opo (Y) +9) 
Gaussian exp (=) 


Histogram intersection 


X weL min(¢, (x) > Pw (y)) 





Table 1: Kernel functions for sequential data. 











Distance d(x,y) Distance d(x,y) 
Manhattan ae L lv X) — bo Y) Chebyshev mMaXpeL |Pw(X) — po (Y) 
g diste — Euer aieo Geodesic arccos Z wer Po Opu 0) 
Canberra pote L Ieee Hellinger? Xe $ (/ Pw (&) -— V bw (y))° 
Minkowski? = > ,,¢7, lw (X) — dw Y)” Jensen-Shannon wer H (bw (X), bo (y)) 





Table 2: Distance functions for sequential data. 


components (0/1) and right mismatching components (1/0) in two binary vectors. Common similar- 
ity coefficients are given in Table 3. For application to non-binary vectors the summation variables 
a,b,c can be extended in terms of an embedding language L (Rieck et al., 2006): 


a= » min(¢, (x), Pw (y)), 


weL 

b= >a [Po X) — min(d,, X), fo (y))], 
weL 

c= >a lw (y) = min(¢,, (x), Pw (y))] : 
weL 


The above definition of a matches the histogram intersection kernel k provided in Table 1, so that 
alternatively all summation variables can be expressed by 





a=k(x,y), b=k(x,x)—k(x,y), c=k(y,y) —k(x, y). (2) 
Sim. Coeff. s(x, y) Sim. Coeff. s(x, y) 
Simpson a/min(a+b,a+c) Kulczynski (1) a/(b+c) 
Jaccard a/(a+b+c) Kulczynski (2) S(a/(at+b)+a/(a +c)) 
Braun-Blanquet a/max(a+b,a+c) Otsuka, Ochiai a//(a+b)(a+c) 
Czekanowski, 2a/(2a+b+c) Sokal-Sneath, a/(a+2(b +c)) 


Sg@rensen-Dice Anderberg 





Table 3: Similarity coefficients for sequential data 
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Hence, one can consider the similarity coefficients given in Table 3 as variations of the histogram 
intersection kernel, for example, the Jaccard coefficient can be formulated solely in terms of k: 


a = k(x, y) 
a+b+c k(x,x)+k(y,y)—k(x,y) 





(x,y) = 


3.3 A Generic Framework for Similarity Measures 


All of the similarity measures introduced in the previous section share a similar mathematical con- 
struction: an inner component-wise function is aggregated over each dimension using an outer 
operator, for example, the linear kernel is defined as the sum of component-wise products and the 
Chebyshev distance as the maximum of component-wise absolute differences. 

One can exploit this shared structure to derive a unified formulation for similarity measures 
(Rieck et al., 2006, 2007), consisting of an inner function m and an outer operator ® as follows 


sx, y) = Pn (ou(X), po Y). (3) 


wel 


For convenience in later design of algorithms, we introduce a “multiplication” operator & which 
corresponds to executing the © operation k times. Thus, for any n € N and x € R, we define & as 


x8&n:=x9...Ðx. 
— 


n 


Given the unified form (3), kernel and distance functions presented in Table 1 and 2 can be 
re-formulated in terms of @ and m. Adaptation of similarity coefficients to the unified form (3) 
involves a re-formulation of the summation variables a, b and c. The particular definitions of outer 
and inner functions for the presented similarity measures are given in Table 4. The polynomial and 
sigmoidal kernels as well as the Geodesic distance are not shown since they can be expressed using 
a linear kernel. For the Chebyshev distance the operator ® represents the identity function, while 
for all other similarity measures it represents a multiplication. 














Kernel D m(x, y) Distance D m(x, y) 
Linear + x-y Manhattan + |x—y| 
Histogram inters. + min(x, y) 7° distance + («-y)?/(~4+y) 
Canberra + |x —yl/(x+y) 
Sim. Coef. p m(x,y) Minkowski? + |x — y|? 
Variable a + min(x, y) Chebyshev max |x—-y| 
Variable b + x-—min(x, y) Hellinger? + (/x — SYY 
Variable c + y—min(x, y) Jensen-Shannon + A (x,y) 








Table 4: Unified formulation of similarity measures. 


As a last step towards the development of comparison algorithms, we need to address the high 
dimensionality of the feature space induced by the embedding language L. The unified form (3) 
theoretically involves computation of m over all w € L, which is practically infeasible for most L. 
Yet the feature space induced by L is sparse, since a sequence x comprises only a limited number of 
contiguous subsequences—at most (|x|? + |x|)/2 subsequences. As a consequence of the sparseness 
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only very few terms ¢,,(x) and ¢,,(y) in the unified form (3) have non-zero values. We exploit this 
fact by defining m (0, 0) =e, where e is the neutral element for the operator ®, so that for any x € R 
holds 

x@e=x, eOx=x. 


By assigning m (0, 0) to e, the computation of a similarity measure can be reduced to cases where 
either p (X) > 0 or dy (y) > 0, as the term m(0, 0) does not affect the result of expression (3). We 
can now refine the unified form (3) by partitioning the similarity measures into conjunctive and 
disjunctive measures using an auxiliary function m: 


s(x, y) = i poa), do(y)). 


wel 


1. Conjunctive similarity measures. The inner function m only accounts pairwise non-zero 
components, so that for any x € R holds m(x,0) =e and m(0,x) =e. 


m(x,y) = 


m(x,y) ifx>OQOandy>0 
otherwise. 


Kernel functions fall into this class, except for the distance-based RBF kernel. By using a 
kernel to express similarity coefficients as shown in expression (2), similarity coefficients 
also exhibit the conjunctive property. 


2. Disjunctive similarity measures. The inner function m requires at least one component to 
be non-zero, otherwise m (0, 0) = e holds. 


big ifx >Oory>0 


m(x,y)= 
y) otherwise. 


Except for the Geodesic distance, all of the presented distances fall into this class. Depend- 
ing on the embedding language, this class is computational more expensive than conjunctive 
measures. 


As a result, any similarity measure, including those in Table 1, 2 and 3, composed of an inner 
and outer function can be applied for efficient comparison of embedded sequences, if (a) a neutral 
element e for the outer function @ exists and (b) the inner function m is either conjunctive or 
disjunctive, that is at least m (0, 0) = e holds. 


4. Algorithms and Data Structures 


We now introduce data structures and algorithms for efficient computation of the proposed similarity 
measures. In particular, we present three approaches differing in capabilities and implementation 
complexity covering simple sorted arrays, tries and generalized suffix trees. For each approach, 
we briefly present the involved data structure, provide a discussion of the comparison algorithm 
and give run-time bounds for extraction and comparison of embedded sequences. Additionally, we 
provide implementation details that improve run-time performance in practice. 

As an example running through this section we consider the two sequences x = abbaa and 
y = baaaab from the binary alphabet A = {a, b} and the embedding language of 3-grams, L = A’. 
For a data structure storing multiple words w € L of possibly different lengths, we denote by k the 
length of longest words. 
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4.1 Sorted Arrays 


Data structure. A simple and intuitive representation for storage of embedded sequences are 
sorted arrays or alternatively sorted lists (Joachims, 2002; Rieck et al., 2006; Sonnenburg et al., 
2007). Given an embedding language L and a sequence x, all words w € L satisfying w E x are 
maintained in an array X along with their embedding values ¢,,(x). Each field x of X consists of 
two attributes: the stored word word[x] and its embedding value phi[x]. The length of an array X 
is denoted by |X|. In order to support efficient comparison, the fields of X are sorted by contained 
words, for example, using the lexicographical order of the alphabet A. Figure 1 illustrates the sorted 
arrays of 3-grams extracted from the two example sequences x and y. 





word[x] philx] 








X e——abb | 1} —[baa | 1} —{bba | 1} 





























Y o——aaa | 2 aab | 1}—{baa | 1} 














Figure 1: Sorted arrays of 3-grams for x = abbaa and y = baaaab. The number in each field 
indicates the number of occurrences. 


Algorithm. Comparison of two sorted arrays X and Y is carried out by looping over the fields 
of both arrays in the manner of merging sorted arrays (Knuth, 1973). During each iteration the 
inner function m is computed over contained words and aggregated using the operator ®. The 
corresponding comparison procedure in pseudo-code is given in Algorithm 1. Herein, we denote 
the case where a word w is present in x and y as match and the case of w being contained in either 
X or y as mismatch. For run-time improvement, these mismatches can be ignored in Algorithm 1 if 
a conjunctive similarity measure is computed (cf. Section 3.3). 


Algorithm 1 Array-based sequence comparison 
1: function COMPARE(X, Y : Array): R 
2: see iel,jel 





3 while i < |X| or j < |Y | do 

4 x < X[i], ye Y[j] 

5 if y = NIL or word[x] < word[y] then œ> Mismatch at x 
6: s <—s @m/(phil[x], 0) 

7 ifitl 

8 else if x = NIL or word[x] > word[y] then > Mismatch at y 
9: s —s @m(0, phily]) 

10: J<jtl 

11: else > Match at x and y 
12: s <—s @m/(phi[x], phily]) 

13: ic¢it+l,jcjt+l 

14: return s 
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Run-time. The comparison algorithm based on sorted arrays is simple to implement, yet it does 
not enable linear-time comparison for all embedding languages, for example, if L = A*. However, 
sorted arrays enable linear-time similarity measures, if there exist O (|x|) words with w E x, which 
implies all w € L have no or constant overlap in x. Examples are the common bag-of-words and 
k-gram embedding languages. 

Under these constraints a sorted array is extracted from a sequence x in O(k|x|) time using 
linear-time sorting, for example, radix sort (Knuth, 1973), where k is the maximum length of all 
words w € L in x. The comparison algorithm requires at most |x| + |y| iterations, so that the worst- 
case run-time is O(k(|x|+|y|)). For extraction and comparison the run-time complexity is linear in 
the sequence lengths due to the constraint on constant overlap of words, which implies k|x| € O (|x|) 
for any k and x. 





Implementation notes. The simple design of the sorted array approach enables a very efficient 
extension. If we consider a CPU with registers of b bits, we restrict the maximum word length k, so 
that every word fits into a CPU register. This restriction enables storage and comparison operations 
to be performed directly on the CPU, that is operations on words w with |w| < k are executed in 
O(1) time. Depending on the size of the alphabet |A| and the CPU bits b, the maximum word 
length is |b/log, |A|]. In many practical applications one can strongly benefit from this extensions, 
as k is either bounded anyway, for example, for k-grams, or longer words are particular rare and 
do not increase accuracy significantly. For example on current 64 bit CPU architectures one can 
restrict k to 32 for DNA sequences with |A| = 4 and to k = 10 for textual documents with |A| < 64. 
Alternatively, embedded words may also be represented using hash values of b bits, which enables 
considering words of arbitrary length, but introduces a probability for hash collisions (Knuth, 1973). 

Another extension for computation of conjunctive measures using sorted arrays has been pro- 
posed by Sonnenburg et al. (2007). If two sequences x and y have unbalanced sizes |x| < [yl], 
one loops over the shorter sorted array X and performs a binary search procedure on Y, in fa- 
vor of processing both sorted arrays in parallel. The worst-case run-time for this comparison is 
O(k(|x| log, |y|)), so that one may automatically apply this extension if for two sequences x and y 
holds |x| log, ly] < [xl +lyl. 


4.2 Tries 


Data structure. A frie is a tree structure for storage and retrieval of sequences. The edges of a 
trie are labeled with symbols of A (Fredkin, 1960; Knuth, 1973). A path from the root to a marked 
node x represents a stored sequence, hereafter denoted by x. A trie node x contains a vector of size 
|A| linking to child nodes, a binary flag to indicate the end of a sequence mark[x] and an embedding 
value phi[x].! The i-th child node representing the i-th symbol in A is accessed via child[x,i]. If 
the i-th child is not present child[x ,i] = NIL. 

A sequences x is embedded using a trie X by storing all w € L with w E x and corresponding 
w(x) in X (Shawe-Taylor and Cristianini, 2004). Figure 2 shows tries of 3-grams for the two ex- 
ample sequences x and y. Note, that for the embedding language of k-grams considered in Figure 2 
all marked nodes are leaves, while for other embedding languages they may correspond to inner 
nodes, for example, for the case of blended k-grams, where every node in a trie marks the end of a 
sequence. 








1. For convenience, we assume that child nodes are maintained in a vector, while in practice sorted lists, balanced trees 
or hash tables may be preferred. 
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mark[x] ġ 
phi[x] © (1) (1) (2) (1) (1) 


Figure 2: Tries of 3-grams for x = abbaa and y = baaaab. The number in brackets at leaves 
indicate the number of occurrences. Marked nodes are squared. White nodes are implicit 
and not maintained in a compact trie representation. 


Algorithm. Comparison of two tries is performed as in Algorithm 2: Starting at the root nodes, 
one recursively traverses both tries in parallel. If the traversal passes at least one marked node the in- 
ner function m is computed as either a matching or mismatching word occurred (Rieck et al., 2006). 
To simplify presentation of the algorithm, we assume that mark[NIL] returns false and child[NIL, i] 
returns NIL. 


Algorithm 2 Trie-based sequence comparison 
1: function COMPARE(X, Y : Trie) : R 





2: s<e 

3: if X = NIL and Y = NIL then > Base case 
4: return s 

5: fori + 1,|A| do 

6: x <child[X,i], y < child[Y, i] 

T: if mark[x] and not mark[y] then > Mismatch at x 
8: s <—s @m/(phi[x], 0) 

9: if not mark[x] and mark[y] then > Mismatch at y 
10: s —s @m(0, phily]) 

11: if mark[x] and mark[y] then > Match at x and y 
12: s <—s @m/(phi[x], phily]) 

13: s <5 @COMPARE(x, y) 

14: return s 





Run-time. The trie-based approach enables linear-time similarity measures over a larger set of 
formal languages than sorted arrays. For tries we require all w € L with w E x to have either constant 
overlap in x or to be prefix of another word, for example, as for the blended k-gram embedding 
languages. 

To determine the run-time complexity on tries, we need to consider the following property: A 
trie storing n words of maximum length k has depth k and at most kn nodes. Thus, a sequence x 
containing O (|x|) words of maximum length k is embedded using a trie in O (k|x|) run-time. As an 
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invariant for the comparison procedure, the nodes x and y in Algorithm 2 stay at the same depth in 
each recursion. Hence, the comparison algorithm visits at most k|x| + k|y| nodes, which results in 
a worst-case run-time of O(k(|x| + |y])). The extraction and comparison run-time is linear in the 
sequence lengths, as we require words to either have constant overlap, which implies k|x| € O (|x|), 
or to be prefix of another word, which implies that both words share an identical path in the trie. 


Implementation notes. The first extension for the trie data structure is aggregation of embedding 
values in nodes. If in Algorithm 2 a mismatch occurs at node x, the algorithm recursively descends 
to all child nodes of x. At this point, however, it is clear that all nodes below x will also be 
mismatches, as all words w with x E, w are not present in the compared trie. This fact can be 
exploited by maintaining an aggregated value ø, at each node x given by 








px :=PBeo.®) with 1={weL|*C, w}. 


wel 


Instead of recursively descending at a mismatching node x, one uses øx to retrieve the aggregation 
of all lower embedding values. The extension allows disjunctive similarity measures to be computed 
as efficient as conjunctive measures at a worst-case run-time of O (k min([x|, |y])). 

The second extension originates from the close relation of tries and suffix trees. The nodes 
of a trie can be classified as implicit if they link to only one child node and as explicit otherwise. 
By iteratively removing implicit nodes and appending their labels to edges of explicit parent nodes 
one obtains a compact trie (cf. Knuth, 1973; Gusfield, 1997). Edges are labeled by subsequences 
encoded using indices i and j pointing to x[i..j]. The major benefit of compact tries is reduced 
space complexity, which decreases from O(k|x|) to O(|x|) independent of the maximum length k 
of stored words. 


4.3 Generalized Suffix Trees 


Data structure. A generalized suffix tree (GST) is a compact trie containing all suffixes of a set of 
sequences X4, . . . , X; (Gusfield, 1997). Every path in a GST from the root to a leaf corresponds to one 
suffix. A GST is obtained by extending each sequence x; with a delimiter $; ¢ A and constructing 
a suffix tree from the concatenation Z = x,$, ...x,$). 
For each GST node v we denote by children[v] the set of child 
S nodes, by length[v] the number of symbols on the incoming edge, by 
depth[v] the total number of symbols on the path from the root node to 
v and by phi[v,i] the number of suffixes of x; passing through node v. 
oe As every subsequence of x; is a prefix of some suffix, phi[v, i] reflects 
sad feng the occurrences (alternatively frequency or binary flag) for all subse- 
‘ ` quences terminating on the edge to v. An example of a GST is given in 
Figure 3. In the remaining part we focus on the case of two sequences x and y delimited by $; and 
$2, computation of similarity measures over a set of sequences being a straightforward extension. 





depth|v] 





Algorithm. Computation of similarity measures is carried out by traversing a GST in depth-first 
order (Rieck et al., 2007). An implementation in pseudo-code is given in Algorithm 3. At each 
node v the inner function m is computed using phi[v,1] and phi[v,2]. To account for different 
words along an edge and to support various embedding languages the function FILTER is employed, 
which selects appropriate contributions similar to the weighting introduced by Vishwanathan and 
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a $i `b ad $zbaa$ı 
aa 4 aan ane 
a” $, b$ $2 baa$, aab$, $ı 
(0,2) 
ab$2 b$2 


é 


Figure 3: Generalized suffix tree for x = abbaa$, and y = baaaab$,. The numbers in brackets at 
each inner node v correspond to philv, 1] and phi[v, 2]. Edges are shown with associated 
subsequences instead of indices. 


Smola (2004). At a node v the function takes Iength[v] and depth[v] of v as arguments to determine 
how much the node and its incoming edge contribute to the similarity measure, for example, for the 
embedding language of k-grams only nodes up to a path depth of k need to be considered. 


Algorithm 3 GST-based sequence comparison 

1: function COMPARE(X, Y : A*): R 

2: T <—CONCAT(X, Y) 

3 S © SUFFIXTREE(T) 

4 return TRAVERSE(root[S]) 

5: function TRAVERSE(v : Node): R 
6: sce 
7 
8 
9 





for c < children[v] do 


s <5 @ TRAVERSE (c) > Depth-first traversal 
n < FILTER(length[v], depth[v]) > Filter words on edge to v 
10: s <—s@m(phi[v, 1], philv,2])@n 


11: return s 





Algorithm 4 shows a filter function for k-grams. The filter returns 0 for all edges that do not 
correspond to a k-gram, either because they are too shallow or too deep in the GST, and returns 1 if 
a k-gram terminates on the edge to a node v. 


Algorithm 4 Filter function for k-grams, L = A* 
function FILTER(v : Node): N 
if depth[v] > k and depth[v] — length[v] < k then 
return | 
return 0 
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Another example of a filter is given in Algorithm 5. The filter implements the embedding 
language L = A*. The incoming edge to a node v contributes to a similarity measure by length[v], 
because exactly Jength[v] contiguous subsequences terminate on the edge to v. 


Algorithm 5 Filter function for all contiguous subsequences, L = A* 
function FILTER(v : Node): N 
return Jength[v] 








The bag-of-words model can be implemented either by encoding each word as a symbol of A or 
by augmenting nodes to indicate the presence of delimiter symbols on edges. Further definitions of 
weighting schemes for string kernels, which are suitable for Algorithm 3, are given by Vishwanathan 
and Smola (2004). 


Run-time. Suffix trees are well-known for their ability to enhance run-time performance of string 
algorithms (Gusfield, 1997). The advantage exploited herein is that a suffix tree comprises a 
quadratic amount of information, namely all suffixes, in a linear representation. Thus, a GST en- 
ables linear-time computation of similarity measures, even if a sequence x contains O (|x|?) words 
and the embedding language corresponds to L = A*. 


There are well-known algorithms for linear-time construction of suffix trees (e.g., Weiner, 1973; 
McCreight, 1976; Ukkonen, 1995), so that a GST for two sequences x and y can be constructed in 
O (|x| + |y]) using the concatenation z = x$,;y$. . As a GST contains at most 2|z| nodes, the worst- 
case run-time of any traversal is O(|z|) = O (|x| + |y|). Consequently, computation of similarity 
measures between sequences using a GST can be realized in time linear in the sequence lengths 
independent of the complexity of L. 


Implementation notes. In practice the GST algorithm may suffer from high memory consump- 
tion, due to storage of child nodes and suffix links. To alleviate this problem an alternative data 
structure with similar properties—suffix arrays—was proposed by Manber and Myers (1993). A 
suffix array is an integer array enumerating the suffixes of a sequence z in lexicographical order. It 
can be constructed in linear run-time, however, algorithms with super-linear run-time are surpris- 
ingly faster on real-world data (see Manzini and Ferragina, 2004; Maniscalco and Puglisi, 2007). 


Abouelhoda et al. (2004) propose a generic procedure for replacing suffix trees with enhanced 
suffix array, for example, as performed for the string kernel computation of Teo and Vishwanathan 
(2006). This procedure involves several auxiliary data structures for maintenance of child nodes and 
suffix links. In our implementation we follow a different approach based on the work of Kasai et al. 
(2001a) and Kasai et al. (2001b). Using a suffix array and an array of longest-common prefixes 
(LCP) for suffixes, we replace the traversal of the GST by looping over a generalized suffix array in 
linear time. 


Application of suffix arrays reduces memory requirements by a factor of 4. About 11|z| bytes 
are required for the modified GST algorithm: 8 bytes for a suffix and inverse suffix array, 2 bytes 
for sequence indices and on average 1 byte for an LCP array. In comparison, a suffix tree usu- 
ally requires over 40|z| bytes (Abouelhoda et al., 2004) and the enhanced suffix array of Teo and 
Vishwanathan (2006) about 19|z| bytes. 


36 


SIMILARITY MEASURES FOR SEQUENTIAL DATA 


5. Experiments and Applications 


In order to evaluate the run-time performance of the proposed comparison algorithms in practice and 
to investigate the effect of different similarity measures on sequential data, we conducted experi- 
ments on real world sequences. We chose nine data sets from the domains of bioinformatics, text 
processing and computer security. Details about each data set, contained sequences and references 
are given in Table 5. 























Name Sequence type # |Al |x|, Reference 
Bioinformatics 

ARTS DNA sequences 46794 4 2400 Sonnenburg et al. (2006) 
C.Elegans | DNA sequences 10025 4 10000 Wormbase WS120 
SPROT Protein sequences 150807 23 467 O’Donovan et al. (2002) 
Text processing 

Reuters News articles 19042 92 839 Lewis (1997) 

Heise News articles 30146 96 1800 www.heise.de 

RFC Text documents 4589 106 49954 www.rfc-editor.org 
Computer security 

HIDS System call traces 25979 83 156 Lippmann et al. (2000) 
NIDS Connection payloads 21330 116 1274 Lippmann et al. (2000) 
Spam Emails bodies 33702 176 1539 Metsis et al. (2006) 





Table 5: Data sets of sequences. The number of sequences in each set is denoted by #, the alphabet 
size by |A| and the average sequence length by |x|, 


5.1 Run-time Experiments 


The linear-time algorithms presented in Section 4 build on data structures of increasing complexity 
and capability—sorted arrays are simple but limited in capabilities, tries are more involved, yet 
they do not cover all embedding languages and generalized suffix trees are relatively complex and 
support the full range of embedding languages. In practice, however, it is the absolute and not 
asymptotic run-time that matters. Since the absolute run-time is affected by hidden constant factors, 
depending on design and implementation of an algorithm, it can only be evaluated experimentally. 


Therefore each algorithm was implemented using enhancements covered in implementation 
notes. In particular, for Algorithm 1 we incorporated 64-bit variables to realize a sorted 64-bit ar- 
ray, for Algorithm 2 we implemented a compact trie representation and for Algorithm 3 we used 
generalized suffix arrays in favor of suffix trees. For each of these algorithms we conducted experi- 
ments using different embedding languages to assess the run-time on the data sets given in Table 5. 


We applied the following experimental procedure and averaged run-time over 10 individual 
runs: 500 sequences are randomly drawn from a data set and a 500 x 500 matrix is computed for the 
Manhattan distance using a chosen embedding language. The run-time of the matrix computation is 
measured and reported in pairwise comparisons per second. Note, that due to the symmetry of the 
Manhattan distance only (n? +n)/2 comparisons need to be performed for an n x n matrix. 
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Figure 4: Run-time of sequences comparison over word k-grams for different algorithms. The x- 
axis gives the word k-gram lengths. The y-axis shows the number of comparison opera- 
tions per second in log-scale. 


Embedding language: bag-of-words. As a first embedding language, we consider the bag-of- 
words model. Since natural words can be defined only for textual data, we limit the focus of this 
experiment to text data sets in Table 5. In particular, we use the embedding language of word 
k-grams—covering the classic “bag of words” as word 1-grams—by using an alphabet of words 
instead of characters. Each symbol of the alphabet is stored in 32 bits, so that up to 2* different 
words can be represented. Experiments are conducted for values of k ranging from 1 to 8. 


Figure 4 shows the run-time performance of the implemented algorithms as a function of k on 
the Reuters, Heise and RFC data sets. The sorted array approach significantly outperforms the other 
algorithms on all data sets, yet it can only be applied for k < 2, as it is limited to 64 bits. For small 
values of k suffix arrays require more time for each comparison compared to compact tries, while 
for k > 5 their performance is similar to compact tries. This difference is explained by the number 
of unique k-grams vx in each sequence x. For small values of k often holds vx < |x|, so that a trie 
comparison requires O(k(vx + vy)) time in contrast to O(|x|+ |y|) for a suffix array. The worse 
run-time performance on the RFC data set is due to longer sequences. 


Embedding language: k-grams. For this experiment we focus on the embedding language of k- 
grams, which is not limited to a particular type of sequences, so that experiments were conducted for 
all data sets in Table 5. In contrast to the previous setup, k-grams are associated with the original 
alphabet of each data set: DNA bases and proteins for bioinformatics, characters for texts, and 
system calls and bytes for computer security. For each data set the value of k is varied from 1 to 19. 


The run-time as a function of k for each data set and algorithm is given in Figure 5. The sorted 
array approach again yields the best performance on all data sets. Moreover, the limitation of sorted 
arrays to 64 bits does not effect all data sets, so that for DNA all k-gram lengths can be computed. 
The suffix array slightly outperforms the trie comparison for larger value of k, as its worst-case 
run-time is independent of the length of k-grams. Absolute performance in terms of the number of 
comparisons per second differs among data sets due to different average sequence lengths. For data 
sets with short sequences (e.g., HIDS, ARTS) performance rates up to 10° comparisons per second 
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Comparison of k-grams (ARTS) 


Comparison of k-grams (C.Elegans) 


Comparison of k-grams (SPROT) 
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Figure 5: Run-time of sequences comparison over k-grams for different algorithms. The x-axis 
gives the k-gram lengths. The y-axis shows the number of comparison operations per 
second in log-scale. 
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are attainable, while for data sets with longer sequences (e.g., Spam, RFC) generally up to 10° — 104 
comparisons per second are achievable at best. 


5.2 Applications 


We now demonstrate that the ability of our approach to compute diverse similarity measures is 
beneficial in real applications, especially in unsupervised learning scenarios. Our experiments are 
performed for: (a) categorization of news articles, (b) intrusion detection in network traffic (c) tran- 
scription start site recognition in DNA sequences. 


Unsupervised text categorization. For this experiment news articles from the topics “Coffee”, 
“Interest”, “Sugar” and “Trade” in the Reuters data set are selected. The learning task is to cate- 
gorize these articles using clustering, without any prior knowledge of labels. As preprocessing we 
remove all stop words and words that occur in single articles only. We then compute dissimilar- 
ity matrices for the Euclidean, Manhattan and Jensen-Shannon distances using the bag-of-words 
embedding language as discussed in Section 3. The embedded articles are finally assigned to four 
clusters using complete linkage clustering (see Duda et al., 2001). 

Figure 6(a) shows projections of the embedded articles obtained from the dissimilarity matrices 
using multidimensional scaling (see Duda et al., 2001). Although projections are limited in describ- 
ing high-dimensional data, they visualize structure and, thus, help to interpret clustering results. 
For example, the projection of the Euclidean distances in Figure 6(a) noticeably differs in shape 
compared to the Manhattan and Jensen-Shannon distances. 

The cluster assignments are presented in Figure 6(b) and the distribution of topic labels among 
clusters is given in Figure 6(c). For the Euclidean distance the clustering fails to unveil features 
discriminative for article topics, as the majority of articles is assigned to a single cluster. In compar- 
ison, the Manhattan and Jensen-Shannon distance allow categorization of the topics “Coffee” and 
“Sugar”, due to observed high frequencies of respective words in articles. However, the Manhattan 
distance does no allow discrimination of the other two topics, as both are mixed among two clus- 
ters. The Jensen-Shannon distance enables better separation of all four topics. The topics “Coffee” 
and “Sugar” are almost perfectly assigned to clusters and the topics “Interest” and “Trade” each 
constitute the majority in a respective cluster. 


Network intrusion detection. Network intrusion detection aims to automatically identify hacker 
attacks in network traffic. As labels for such data are hard to obtain in practice, unsupervised 
learning has gained attention in intrusion detection research. The NIDS data set used for the run- 
time experiments (cf. Table 5) is known to contain major artifacts (see McHugh, 2000). In order 
to provide a fair investigation of the impact of various similarity measures on detection of attacks, 
we generated a smaller private data set. Normal traffic was recorded from the members of our 
laboratory by providing a virtual network. Additionally attacks were injected by a security expert. 

For this experiment we pursue an unsupervised learning approach to network intrusion detection 
(Rieck and Laskov, 2007). The incoming byte sequences of network connections are embedded 
using the language of 5-grams, and Zeta, an unsupervised anomaly detection method based on k- 
nearest neighbors, is applied over the following similarity measures: the Euclidean, Manhattan and 
Jensen-Shannon distance and the second Kulczynski coefficient (see Section 3.2). 

ROC curves for the detection of attacks in the network protocols HTTP, FTP and SMTP are 
shown in Figure 7. Application of the Jensen-Shannon distance and second Kulczynski coefficient 


40 


SIMILARITY MEASURES FOR SEQUENTIAL DATA 


Projection of Euclidean distances Projection of Manhattan distances Projection of Jensen—-Shannon distances 



















Interest 
Sugar 
Trade 



































(a) Projection of embedded articles with true topic label assignments 
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(c) Ratio of topic labels in clusters of embedded articles 


Figure 6: Clustering of Reuters articles using different similarity measures (bag-of-words). 


yield the highest detection accuracy. Over 78% of all attacks are identified with no false-positives 
in an unsupervised setup. In comparison, the Euclidean and Manhattan distance give significantly 
lower detection rates on the FTP and SMTP protocols. The poor detection performance of the latter 
two similarity measures emphasizes that choosing a discriminative similarity measure is crucial for 
achieving high accuracy in a particular application. 


Transcription start site recognition. The last application focuses on recognition of transcription 
start sites (TSS), which mark the beginning of genes in DNA. We consider the ARTS data set, which 
comprises human DNA sequences that either cover the TSS of protein coding genes or have been 
extracted randomly from the interior of genes. Following the approach of Sonnenburg et al. (2006) 
these sequences are embedded using the language of 6-grams and a support vector machine (SVM) 
and a bagged k-nearest neighbor classifier are trained and evaluated on the different partitions of 


41 


RIECK AND LASKOV 










































































Anomaly detection (HTTP traffic) Anomaly detection (FTP traffic) Anomaly detection (SMTP traffic) 
1 1 — yy 
4 0.7 
2 T 2 2 
© 06 4 g £06 
o o o 
2 2 2 
w 0.5 J D a 0.5 
Q [e] fe} 
a a a 
s 0.4 7 g s 0.4 
0.3 3 0.3 
0.2 -A Euclidean 0.2 -A Euclidean 0.2 -A Euclidean 
-© Manhattan -© Manhattan -© Manhattan 
0.1 ==- Jensen-Shannon |- 0.1 —y- Jensen-Shannon 0.1 ==- Jensen-Shannon 
= Kulczynski (2) = Kulczynski (2) = Kulczynski (2) 
0 0 0 
0 0.002 0.004 0.006 0.008 0.01 0 0.002 0.004 0.006 0.008 0.01 0 0.002 0.004 0.006 0.008 0.01 
false positive rate false positive rate false positive rate 


Figure 7: ROC curves for unsupervised anomaly detection on 5-grams of network connections and 
attacks recorded at our laboratory. 
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Figure 8: ROC curves for supervised and unsupervised recognition of transcription start sites (TSS) 
on 6-grams of DNA sequences (ARTS data set). 


the data set. Each method is evaluated for the Euclidean distance, the Manhattan distance and 
the second Kulczynski coefficient. As only 10% of the sequences in the data set correspond to 
transcription start sites, we additionally apply the unsupervised outlier detection method Gamma 
(Harmeling et al., 2006), which is similar to the method employed in the previous experiment. 


Figure 8 shows the performance achieved by the bagged k-nearest neighbor classifier and the 
unsupervised learning method.” The accuracy in both setups strongly depends on the chosen sim- 
ilarity measures. The metric distances yield better accuracy in the supervised setup. The second 
Kulczynski coefficient and also the Manhattan distance perform significantly better than the Eu- 
clidean distance in unsupervised application. In the absence of labels these measures express better 
discriminative properties for TSS recognition, that are difficult to access through Euclidean dis- 
tances. For the supervised application, the classification performance is limited for all similarity 





2. Results for the SVM are similar to the bagged k-nearest neighbor classifier and have been omitted. 
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measures, as only some discriminative features for TSS recognition are encapsulated in k-gram 
models (cf. Sonnenburg et al., 2006). 


6. Conclusions 


The framework for comparison of sequences proposed in this article provides means for efficient 
computation of a large variety of similarity measures, including kernels, distances and non-metric 
similarity coefficients. The framework is based on embedding of sequences in a high-dimensional 
feature space using formal languages, such as k-grams, contiguous subsequences, etc. Three im- 
plementations of the proposed framework using different data structures have been discussed and 
experimentally evaluated. 

Although all three data structures that were considered—sorted arrays, tries and generalized 
suffix trees—have asymptotically linear run-time, significant differences in the absolute run-time 
have been observed in our experiments. The constant factors are affected by various design issues 
illustrated by our remarks on implementation of the proposed algorithms. In general, we have 
observed a consistent trade-off between run-time efficiency and complexity of embedding languages 
a particular data structure can handle. Sorted arrays are the most efficient data structure; however, 
their applicability is limited to kK-grams and bag-of-words models. On the other end of the spectrum 
are generalized suffix trees (and their more space-efficient implementation using suffix arrays) that 
can handle unrestricted embedding languages—at a cost of more complicated algorithms and lower 
efficiency. The optimal data structure for computation of similarity measures thus depends on the 
embedding language to be used in a particular application. 

The proposed framework offers machine learning researchers an opportunity to use a large va- 
riety of similarity measures for applications that involve sequential data. Although an optimal sim- 
ilarity measure—as it is well known and has been also observed in our experiments—depends on 
a particular application, the technical means for seamless incorporation of various similarity mea- 
sures can be of great utility in practical applications of machine learning. Especially appealing is the 
possibility for efficient computation of non-Euclidean distances over embedded sequences, which 
extend applicable similarity measures for sequences beyond inner-products and kernel functions. 

In general, the proposed framework demonstrates an important advantage of abstracting data 
representation—in the form of pairwise relationships—from learning algorithms, which will hope- 
fully motivate further investigation of learning algorithms using a general form of such abstraction. 
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